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ABSTRACT 

Characterizing motif (i.e., locally connected subgraph patterns) statis¬ 
tics is important for understanding complex networks such as on¬ 
line social networks and communication networks. Previous work 
made the strong assumption that the graph topology of interest is 
known, and that the dataset either fits into main memory or is stored 
on disk such that it is not expensive to obtain all neighbors of any 
given node. In practice, researchers have to deal with the situation 
where the graph topology is unknown, either because the graph is 
dynamic, or because it is expensive to collect and store all topo¬ 
logical and meta information on disk. Hence, what is available 
to researchers is only a snapshot of the graph generated by sam¬ 
pling edges from the graph at random, which we called a ''RESam- 
pled graph'’. Clearly, a RESampled graph’s motif statistics may be 
quite different from the underlying original graph. To solve this 
challenge, we propose a framework and implement a system called 
Minfer, which can take the given RESampled graph and accurately 
infer the underlying graph’s motif statistics. We also use Eisher in¬ 
formation to bound the errors of our estimates. Experiments using 
large scale datasets show our method to be accurate. 

1. INTRODUCTION 

Complex networks are widely studied across many fields of sci¬ 
ence and technology, from physics to biology, and from nature to 
society. Networks which have similar global topological features 
such as degree distribution and graph diameter can exhibit signif¬ 
icant differences in their local structures. There is a growing in¬ 
terest to explore these local structures (also known as ''motifs”), 
which are small connected subgraph patterns that form during the 
growth of a network. Motifs have many applications, for exam¬ 
ple, they are used to characterize communication and evolution 
patterns in online social networks (OSNs ) |[7|[T4|[^[^ , pattern 
recognition in gene expression profiling protein-protein in¬ 
teraction prediction Q, and coarse-grained topology generation of 
networks (TT) Eor instance, 3-node motifs such as "the friend of 
my friend is my friend” and "the enemy of my enemy is my friend” 
are well known evolution patterns in signed (i.e., friend/foe) social 
networks. Kunegis et al. (E) considered the significance of motifs 
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in Slashdot ZocQ and how they impact the stability of signed net¬ 
works. Other more complex examples include 4-node motifs such 
as bi-fans and bi-parallels defined in pO) . 

Although motifs are important characteristics to help researchers 
to understand the underlying network, one major technical hurdle 
is that it is computationally expensive to compute motif frequen¬ 
cies since this requires one to enumerate and count all subgraphs 
in a network, and there exist a large number of subgraphs even for 
a medium size network with less than one million edges. Eor ex¬ 
ample, the graphs Slashdot and Epinions p5) , which contain 
approximately 1.0 x 10^ nodes and 1.0 x 10® edges have more than 
2.0 X 10^® 4-node connected and induced subgraphs j^. To ad¬ 
dress this problem, several sampling methods have been proposed 
to estimate the frequency distribution of motifs fT3]|T7] [4|[^. All 
these methods require that the entire graph topology fit into mem¬ 
ory, or the existence of an I/O efficient neighbor query API avail¬ 
able so that one can explore the graph topology, which is stored 
on disk. In summary,previous work focuses on designing compu¬ 
tationally efficient methods to characterize motifs when the entire 
graph of interest is given. 


G 



Figure 1: An example of the available RESampled G* and the 
underlying graph G, 


In practice the graph of interest may not be known, but instead 
the available dataset is a subgraph sampled from the original graph. 
This can be due to the following reasons: 

• Data collection: Sampling is inevitable for collecting a large 
dynamic graph given as a high speed stream of edges. Eor 
example, sampling is used to collect network traffic on back¬ 
bone routers in order to study the network graph where a 
node in the graph represents a host and an edge (u, v) repre¬ 
sents a connection from host u to host v, because capturing 
the entire traffic is prohibited due to the high speed traffic and 
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limited resources (e.g. memory and computation) of network 
devices. 

• Data transportation: Sampling may also be required to re¬ 
duce the high cost of transporting an entire dataset to a re¬ 
mote data analysis center. 

• Memory and computation: Sometimes the graph of interest 
is given in a memory expensive format such as a raw text file, 
and may be too large to fit into memory. Moreover, it may 
be too expensive to preprocess and organize it on disk. In 
such cases, it may be useful to build a relatively small graph 
consisting of edges sampled from the graph file at random, 
and compute its motif statistics in memory. 

A simple example is given in Fig.[^ where the sampled graph G* 
is derived from the dataset representing G. In this work, we assume 
the available graph G* is obtained through random edge sampling 
(i.e, each edge is independently sampled with the same probability 
0 < p < 1), which is popular and easy to implement in prac¬ 
tice. Formally, we denote the graph G* as a RESampled graph 
of G. One can easily see that a RESampled graph’s motif statis¬ 
tics will differ from those of the original graph due to uncertainties 
introduced by sampling. Eor example, Eig.[^ shows that s* is a 4- 
node induced subgraph in the RESampled graph G*, and we do not 
know from which original induced subgraph s in G that it derives. 
s could be any one of the five subgraphs depicted in Eig.|^ 



Figure 2: s* is a 4-node induced subgraph in the RESampled 
graph G*, and s is the original induced subgraph of s* in the 
original graph G, 

Unlike previous methods fT3][T7l [4l[^, we aim to design an ac¬ 
curate method to infer motif statistics of the original graph G from 
the available RESampled graph G*. These previous methods focus 
on designing computationally efficient sampling methods based on 
sampling induced subgraphs in G to avoid the problem shown in 
Eig.[^ Hence they fail to infer the underlying graph’s motif statis¬ 
tics from the given RESampled graph. The gSH method in can 
be used to estimate the number of connected subgraphs from sam¬ 
pled edges. However it cannot be applied to characterize motifs, 
i.e., connected and induced subgraphs (or CISes), because motif 
statistics can differ from connected subgraphs’ statistics. Eor ex¬ 
ample, Eig.|^ shows that 75% of a graph’s 4-node connected sub¬ 
graphs are isomorphic to a 4-node line (i.e., the^r^t motif in Eig.|^ 
(b)), while 50% of its 4-node CISes are isomorphic to a 4-node 
line. 

Contribution: Our contribution can be summarized as: To the 
best of our knowledge, we are the first to study and provide an 
accurate and efficient solution to estimate motif statistics from a 
given RESampled graph. We introduce a probabilistic model to 
study the relationship between motifs in the RESampled graph and 
in the underlying graph. Based on this model, we propose an accu¬ 
rate method, Minfer, to infer the underlying graph’s motif statistics 
from the RESampled graph. We also provide a Eisher information 



Figure 3: 4-node CISes vs. 4-node connected subgraphs. 


based method to bound the error of our estimates. Experiments on 
a variety of real world datasets show that our method can accurately 
estimate the motif statistics of a graph based on a small RESampled 
graph. 

This paper is organized as follows: The problem formulation is 
presented in Section Section [^presents our method (i.e. Min¬ 
fer) for inferring subgraph class concentrations of the graph under 
study from a given RESampled graph. Section [^presents methods 
for computing the given RESampled graph’s motif statistics. The 
performance evaluation and testing results are presented in Sec¬ 
tion 1^ Section summarizes related work. Concluding remarks 
then follow. 

2. PROBLEM FORMULATION 

In this section, we first introduce the concept of motif concentra¬ 
tion, then we discuss the challenges of computing motif concentra¬ 
tions in practice. 

Denote the underlying graph of interest as a labeled undirected 
graph G = (U, E, L), where U is a set of nodes, is a set of 
undirected edges, E E V x V, and L is a set of labels lu,v as¬ 
sociated with edges {u,v) G E. Eor example, we attach a label 
lu,v G {^, to indicate the direction of the edge (u, v) ^ E 

for a directed network. Edges may have other labels too, for in¬ 
stance, in a signed network, edges have positive or negative labels 
to represent friend or foe relationship. If L is empty, then G is 
an unlabeled undirected graph, which is equivalent to the regular 
undirected graph. 

Motif concentration is a metric that represents the distribution of 
various subgraph patterns that appear in G. To illustrate, we show 
the 3-, 4- and 5-nodes subgraph patterns in Eigs. [^and re¬ 
spectively. To define motif concentration formally, we first need to 
introduce some notation. Eor ease of presentation. Table [^depicts 
the notation used in this paper. 

An induced subgraph of G, G' = (U', E', L'), V' CV,E' C 
E and L' C L, is a subgraph whose edges and associated labels are 
all in G, i.e. E' = {(u, v) : u,v e V', (u, v) G E}, L' = {lu,v ■ 
u,v G V, (u, v) G E}. We define G^^^ as the set of all connected 
induced subgraphs (CISes) with k nodes in G, and denote = 
|. Eor example, Eig.Mdepicts all possible 4-node CISes. Let 
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Table 1: Table of notations. 


G 

G = {V,E,L) is the graph under study 

G* 

G* F^*, L*) is a RESampled graph 

v{s),se 

set of nodes for /c-node CIS s 

E{s),se 

set of edges for /c-node CIS s 

M{s) 

associated motif of CIS s 

Tk 

number of /c-node motif classes 

My'> 

z-th /c-node motif 


set of k-nodQ CISes in G 


set of CISes in G isomorphic to 

II 

number of /c-node CISes in G 

'G- 

II 

G- 

number of CISes in G isomorphic to 

(fc) 

m\ 

number of CISes in G* isomorphic to 

II 

concentration of motif in G 

P 

matrix P = [Pij]i<i,3<Tk 


probability that a /c-node CIS s* in G* 
isomorphic to given its original 

CIS s in G isomorphic to 


number of subgraphs of Mj ^ isomorphic 
to Mf ^ 


nW = (n7,...,n7r 


= {mP,... ,myiy 


w 

II 

77^ 

II 

concentration of motif in G* 

P 

probability of sampling an edge 

Q 

q=l-p 


Tk denote the number of /c-node motifs and denote the k- 
node motif. For example, T 4 = 6 and , • • •, are the six 
4-node undirected motifs depicted in Fig.|^(b). Then we partition 
into Tk equivalence classes, or C[^\ ..., , where CISes 

within are isomorphic to 




V* E, and L* are node, edge, and edge label sets 

of G* respectively. G* is obtained by random edge sampling, 
i.e., each edge in E is independently sampled with the same 
probability 0 < p < 1 , where p is known in advance. 

• Assumption 2: The label of a sampled edge (u, v) e G* is 
the same as that of (u, v) in G, i.e., 

These two assumptions are satisfied by many applications’ data col¬ 
lection procedures. For instance, the source data of online applica¬ 
tions such as network traffic monitoring is given as a streaming 
of directed edges, and the following simple method is computa¬ 
tional and memory efficient for collecting edges and generating 
a small RESampled graph, which will be sent to remote network 
traffic analysis center: Each incoming directed edge u ^ vis sam¬ 
pled when t(u, v) < pp, where p is an integer (e.g., 10 , 000 ) and 
t(u, v) is a hash function satisfying t(u, v) — t(u, u) and map¬ 
ping edges into integers 0,1,..., p — 1 uniformly. The property 
t(u, u) = t(u,u) guarantees that edges u ^ v and u ^ u are 
sampled or discarded simultaneously. Hence the label of a sampled 
edge (u, ^ is the same as that of {u, v) in G. Using universal 
hashing Q, a simple instance of t(u, u) is given as the following 
function when each u G U is an integer smaller than A 

t(u, u) = (a(min{u, u}A + max{u, u})+5) mod 7 mod p, 

where 7 is a prime larger than A^, a and b are any integers with 
a G {1,..., p — 1} and b G {0,..., p — 1}. We can easily 
find that t(u, v) = t(u, u) and r{u, v) maps edges into integers 
0,1,..., p — 1 uniformly. The computational and space complex¬ 
ities of the above sampling method are both 0(1), which make it 
easy to use for data collections in practice. As alluded before, in 
this paper, we aim to accurately infer the motif concentrations ofG 
based on the given RESampled graph G*. 

3 . MOTIF STATISTICAL INFERENCE 

The motif statistics of RESampled graph O* and original graph 
G can be quite different. In this section, we introduce a probabilis¬ 
tic model to bridge the gap between the motif statistics of O* and 
G. Using this model, we will show there exists a simple and con¬ 
cise relationship between the motif statistics of G and G *. We then 
propose an efficient method to infer the motif concentration of G 
from G*. Einally, we also give a method to construct confidence 
intervals of our estimates of motif concentrations. 


Figure 6: All classes of three-node signed and undirected motifs 
(The numbers are the motif IDs). 

Let denote the frequency of the motif i.e., the num¬ 
ber of the CISes in G isomorphic to Eormally, we have 

= \G\^\, which is the number of CISes in G\^\ Then the 
concentration of is defined as 


Thus, is iht fraction of /c-node CISes isomorphic to the motif 
among all /c-node CISes. In this paper, we make the follow 
assumptions: 

• Assumption 1: The complete G is not available to us, but a 
RESampled graph G* = (U*, L*) of G is given, where 


3.1 Probabilistic Model of Motifs in G* and G 

To estimate the motif statistics of G based on G*, we develop a 
probabilistic method to model the relationship between the motifs 
in G* and G. Define P = [Pij] where Pij is the probability that 

(k) 

s is isomorphic to motif given that s is isomorphic to motif 

Pij = P{M{s*) = My'>\M{s) = Mf'’). 

To obtain Pij, we first compute fij, which is the number of 
subgraphs of isomorphic to For example, (i.e., 

the triangle) includes three subgraphs isomorphic to M{ ^ (i.e., the 
wedge) for the undirected graph shown in Fig.[^(a). Thus, we have 
01,2 = 3 for 3-node undirected motifs. When i = j, fij = 1. 
It is not easy to compute fij manually for 4- and 5-node motifs. 
Hence we provide a simple method to compute fij in Algorithm[^ 
The computational complexity is 0(k^k\). Note that the cost of 
computing fij is not a big concern because in practice, k is usually 
five or less for motif discovery. Denote by U(s) and F^(s) the 
sets of nodes and edges in subgraph s respectively. We have the 
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Figure 4: All classes of three-node, four-node, and five-node undirected and connected motifs (The numbers are the motif IDs). 
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Figure 5: All classes of three-node directed and connected motifs (The numbers are the motif IDs). 


following equation 

p. . _ >)| |B(Mj'‘))|-|B(Mf >)| 

where g = 1—p. For all CISes in G isomorphic to , the above 
model tells us that approximately Pij x 100% of these CISes are 
expected to appear as CISes isomorphic to in G*. 


Algorithm 1: Pseudo-code of computing (j)ij, i.e., the number 
of subgraphs of that are isomorphic to 

1: Step 1: Generate two graphs G = ({^i ,... ,Vk}, E, L) and 
G — {{ui ,..., Ufc}, L), isomorphic to the motifs 
and respectively, where E and L are the edges and edge 
labels of nodes {vi,,Vk}, and E and L are the edges and 
edge labels of nodes {ui,... ,Uk}. 

2: Step 2: Initialize a counter yij = 0. For each permutation 
{xi ,..., Xfc) of integers 1 ,..., /c, yij keeps unchanged when 
there exists an edge (ua, Vb) G E satisfying (uxa ? ^ E 

or iva^vi, % lux^ . and yi^j = yij + 1 otherwise. 

3: Step 3: Initialize a counter Zj =0. For each permutation 

... ^Xk) of integers 1 , ..., /c, keeps unchanged when 
there exists an edge (ua, Vh) G E satisfying {vxa ? '^x^) ^ E 
or lva,vi, % ^ and Zj ^ Zj + I otherwise. 

4: Step 4: Finally, = yi^jjZj. 


3.2 Motif Concentration Estimation 

Using the above probabilistic model, we propose a method Min- 
fer to estimate motif statistics of G from G*. Denote by rrS^\ 
l<z<Tfc,A; = 3,4,..., the number of CISes in G* isomorphic 
to the motif The method to compute is presented in 

next section. Then, the expectation of is computed as 

E[mf^]= E nfPi,,-. (1) 

l<j<Tk 

In matrix notation. Equation ^ can be expressed as 

where [Pij]i<i,j<Tk, = (n^^\ ..., and = 

{m[^\ ..., . Then, we have 

Thus, we estimate as 

-(fc) 0-1 (k) 

where • • •, )'*'• easily have 

E[n^^)] = E[p-'m(^^] = p-'E[m(^^] = 

therefore is an unbiased estimator of Einally, we esti- 
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mate as follows 


/s (k) 

to) = 


(k) 




l<i<Tk. 


( 2 ) 


Ah) 


the concentration of motif m/ in G 


(fc) 

i.e., Pi ^ = 


Denote by p ■ 

Then we observe that (|2| is equivalent to the following 
equation, which directly describes the relationship between motif 
concentrations of G and G*. Let Cb — [Cj^^ , • • •, P — 

[pf\ 




p-V 
w ’ 


(3) 


where W = [1,..., 1]P is a normalizer. For 3-node undi¬ 
rected motifs, P is computed as 


and the inverse of P is 


Expressions for P and P~^ for 3-node signed undirected mo¬ 
tifs, 3-node directed motifs, 4-node undirected motifs, and 5-node 
undirected motifs can be found in Appendix. 


3.3 Lower Bound on Estimation Errors 

It is difficult to directly analyze the errors of our estimate (b, be¬ 
cause it is complex to model the dependence of sampled CISes due 
to their shared edges and nodes. Instead, we derive a lower bound 
on the mean squared error (MSE) of cb using the Cramer-Rao lower 
bound (CRLB) of cb, which gives the smallest MSE that any unbi¬ 
ased estimator of cj can achieve. Eor a k-node CIS s selected from 
k-node CISes of G at random, the probability that s is isomorphic 
to the j-th /c-node motif is P{M{s) = Let s* be 

the induced subgraph of the node set I/(s) in the RESampled graph 
G*. Clearly, s* may not be connected. Eurthermore there may exist 
nodes in V(s) that are not present in G*. We say s* is evaporated 
in G* for these two scenarios. Let Po,j denote the probability that 
s* is evaporated given that its original CIS s is isomorphic to the 
j-th k-nodc motif. Then, we have 


Tk 

l=l 

Eor a random /c-node CIS s of G, the probability that its associated 
s* in G* is isomorphic to the i-th k-nodc motif is 

= P(M(s*) = = l<i<n, 

j=l 

and the probability that s* is evaporated is 

i=i 

When s* is evaporated, we denote M{s*) — 0. Then, the likeli¬ 
hood function of M(s*) with respect to is 

0 < i < n. 


The Eisher information of M{s*) with respect to is defined 
as a matrix J = [Ji, 3 \i<i,j<Tk^ where 


Jid = 

Tk 

= E 

z=o 




dwi 


dUJn 




duji 


dujj 


Tk 


l=0 
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Eor simplicity, we assume that the CISes of G* are independent 
(i.e., none overlapping edges). Then the Eisher information matrix 
of all k-nodo CISes is J. The Cramer-Rao Theorem states that 
the MSE of any unbiased estimator is lower bounded by the inverse 
of the Eisher information matrix, i.e.. 


MSE(wf = E[(a)f ^ > 


(J-^k,-a.W(^W)T 

77,(^) 


provided some weak regularity conditions hold p4) . Here the term 
corresponds to the accuracy gain obtained by account¬ 
ing for the constraint = 1. 


4. 3-, 4-, AND 5-NODE CIS ENUMERATION 

The existing generalized graph enumeration method Oz) can be 
used for enumerating all k-nodo CISes in the RESampled graph 
G*, however it is complex to apply and is inefficient for small val¬ 
ues of k = 3,4,5. In this section, we first present a method (an 
extension of the Nodelterator-i-i- method in | [2^ ) to enumerate all 
3-node CISes in G*. Then, we propose new methods to enumerate 
4 and 5-node CISes in G* respectively. In what follows we denote 
N* (u) as the neighbors of uinG*. Note that in this section G* is 
the default graph when a function’s underlying graph is omitted for 
simplicity. Eor example, the CIS with nodes u, v, and w refers to 
the CIS with nodes u, u, and w inG*. 

4.1 3-node CIS Enumeration 

Algorithm]^ shows our 3-node CISes enumeration method. Sim¬ 
ilar to the Nodelterator-i-i- method in p^ , we “pivot" (the associ¬ 
ated operation is discussed later) each node u E V* to enumerate 
CISes including u. Eor any two neighbors v and w of u, we can 
easily find that the induced graph s with nodes u, v and w is sl 3- 
node CIS. Thus, we enumerate all pairs of two nodes in N*{u), 
and update their associated 3-node CIS for u. We call this process 
“pivoting" u for 3-node CISes. 

Clearly, a 3-node CIS s is counted three times when the associ¬ 
ated undirected graph of s by discarding edge labels is isomorphic 
to a triangle, once by pivoting each node u, u, and w. Let be 
a total order on all of the nodes, which can be easily defined and 
obtained, e.g. from array position or pointer addresses. To ensure 
each CIS is enumerated once and only once, we let one and only 
one node in each CIS be “responsible" for making sure the CIS gets 
counted. When we “pivot" u and enumerate a CIS s, s is counted if 
u is the ‘responsible" node of s. Otherwise, s is discarded and not 
counted. We use the same method in | [27l|2^ , i.e., let the node with 
lowest order in a CIS whose associated undirected graph isomor¬ 
phic to a triangle be the “responsible" node. For the other classes 
of CISes, their associated undirected graphs are isomorphic to an 
unclosed wedge, i.e., the first motif in Fig.[^(a). For each of these 
CISes, we let the node in the middle of its associated undirected 
graph (e.g., the node with degree 2 in the unclosed wedge) be the 
“responsible" node. 

4.2 4-node CIS Enumeration 

Algorithmic shows our 4-node CISes enumeration method. To 
enumerate 4-node CISes, we “pivoting" each node u as follows: 
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Algorithm 2: 3-node CIS enumeration. 

input : G* * = {V\E\L^) 

/-k records the number of CISes in 

isomorphic to motif 1 < z < T3 . */ 

output: • • • 5 ^tIY 


for u do 

for V e N*{u) do 

for w G N* {u) and w y v do 

/-k induced(G*, r) returns the CIS 
with the node set F of G*. */ 

s ^ induced(G*, {u, u, !(;}); 
if {w, v) G and u y v then 
I continue(); 
end 


/k M(s) is the motif class ID of 
s. k/ 

i y- M{s); 

(3) , (3) I 1 

ml ^ y- ml + 1 ; 


end 


end 


end 


For each pair of u’s neighbors v and w where w y u, we compute 
the neighborhood of u, u, and w , defined as F = N* {u)UN* (v) U 
N*{w) — {u,v,w}. For any node x G F, we observe that the 
induced graph s consisting of nodes u,v,w, and a; is a 4-node CIS. 
Thus, we enumerate each node a; in F, and update the 4-node CIS 
consisting of u, u, w, and x. We repeat this process until all pairs 
of u’s neighbors v and w are enumerated and processed. 

Similar to 3-node CISes, some 4-node CISes might be enumer¬ 
ated and counted more than once when we “pivoting" each node u 
as above. To solve this problem, we propose the following methods 
for making sure each 4-node CIS s is enumerated and gets counted 
once and only once: When (u, x) G and w y x, wq discard 
X. Otherwise, denote by s the associated undirected graph of s by 
discarding edge labels. When s includes one and only one node u 
having at least 2 neighbors in V (s), we let u be the “responsible" 
of s. For example, the node 4 is the “responsible" node of the first 
subgraph in Fig.[7] When s includes more than one node having at 
least 2 neighbors in V (s), we let the node with lowest order among 
the nodes having at least 2 neighbors in V (s) be the “responsible" 
node of s. For example, the nodes 6 and 3 are the “responsible" 
nodes of the second and third subgraphs in Fig.[^ 



Figure 7: Examples of “responsible” nodes of 4-node CISs. 
Graphs shown are CISes’ associated undirected graphs, and 
the number near to a node represents the node order. Red 
nodes are “responsible” nodes. 


4.3 5-node CIS Enumeration 

Algorithm 1^ describes our 5-node CISes enumeration method. 
For a 5-node CIS s, we classify it into two types according to its 


Algorithm 3: 4-node CIS enumeration. 

input : G* = {V\E\L*) 

/k records the number of CISes in G* 

isomorphic to motif 1 < z < T4 . */ 

output: • • •, 


for zz G y * do 

foru G iV*(u) do 

for w G N* (u) and w y v do 

F = N* (u) U N* (v) U N* (w) — {u, u, w}; 

for X G F do 

if (u, x) G and w y x then 
I continue(); 

end 

/k induced(G*, {zz, u, zu, x}) is 

defined same as Aig. [^. k/ 
induced(G*, {zz, v, w, x}); 
undirected(s) returns the 
associated undirected graph 
of s by discarding edge 
labels. k/ 

undirected(s); 
f indNodes(s, t) returns the 
set of nodes in y(s) having 
at least t neighbors in 
V{s). k/ 

- f indNodes(s, 2); 
if I A| > 2 then 

/k minNodes(A) returns the 
node with the lowest 
order in y(s). */ 

if zz minNodes(A) then 
I continue(); 
end 

end 

z ^ M(s); 

(4) 


S G- 
/ * 


S G- 
/ 


A 


■ m^^^ -f 1; 


end 


end 


end 


end 


associated undirected graph s\ 

• 5-node CIS s with type 1: s has at least one node having 
more than two neighbors in y(s); 

• 5-node CIS s with type 2: s has no node having more than 
two neighbors in 1/(s), i.e., s is isomorphic to a 5-node line 
or a circle, i.e., the first or sixth motifs in Fig. E](c)- 

We propose two different methods to enumerate these two types of 
5-node CISes respectively. 

To enumerate 5-node CISes with type 1, we “pivoting" each node 
zz as follows: When zz has at least three neighbors, we enumerate 
each combination of three nodes v,w,x G (zz) where x y w y 
V, and then compute the neighborhood of zz, v, w, and x, defined 
as F ^ A^*(zz) U N*{v) U N*{w) U N*{x) — {zz, z;, zz;, x}. For 
any node z/ G F, we observe that the induced graph s consisting 
of nodes zz, v, zz;, x, and y is a 5-node CIS. Thus, we enumerate 
each node y in F, and update the associated 5-node CIS consisting 
of zz, z;, zz;, X, and y. We repeat this process until all combinations 
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of three nodes v,w,x G N*{u) are enumerated and processed. 
Similar to 4-node CISes, we propose the following method to make 
sure each 5-node s is enumerated and gets counted once and only 
once: When (y, u) G E* and y y x, we discard y. Otherwise, let s 
be the associated undirected graph of s, and we then pick the node 
with lowest order among the nodes having more than two neighbors 
in y(s) be the “responsible" node. The third and fourth subgraphs 
in Fig.[^are two corresponding examples. 

To enumerate 5-node CISes with type 2, we “pivoting" each node 
u as follows: When u has at least two neighbors, we first enumerate 
each pair of u’s neighbors v and w where (u, w) ^ Then, we 
compute r^; defined as the set of u’s neighbors not including u and 
w and not connecting to u and w, that is, Ty ^ N* {v) — {u, w} — 
N''{u) — N*{w). Similarly, we compute Fw defined as the set 
of u;’s neighbors not including u and v and not connecting to u 
and u, i.e., F^y ^ N*{w) — {u,v} — N*{u) — N*{v). Clearly, 
T-y n = 0 . For any x E Fy and y G Fy^, we observe that 
the induced graph s consisting of nodes u, u, w, x, and y is a 5- 
node CIS with type 2. Thus, we enumerate each pair of a; G F^; 
and y E Fw, and update the 5-node CIS consisting of u, v, w, 
X, and y. We repeat this process until all pairs of u’s neighbors v 
and w are enumerated and processed. To make sure each CIS s is 
enumerated and gets counted once and only once, we let the node 
with lowest order be the “responsible" node when the associated 
undirected graph s of s isomorphic to a 5-node circle. When s is 
isomorphic to a 5-node line, we let the node in the middle of the 
line be the “responsible" node. The first and second subgraphs in 
Fig. are two examples respectively. 


3 f 

4 '» 

7 

9 

1 i 


1 9 7 7 9 

"Ct .y 

7 Q ^ *1 


Figure 8: Examples of “responsible" nodes of 5-node CISs. 
Graphs shown are CISes’ associated undirected graphs, and 
the number near to a node represents the node order. Red 
nodes are “responsible" nodes. 


5. EVALUATION 

In this section, we first introduce our experimental datasets. Then 
we present results of experiments used to evaluate the performance 
of our method, Minfer, for characterizing CIS classes of size k — 
3,4,5. 


5.1 Datasets 

We evaluate the performance of our methods on publicly avail¬ 
able datasets taken from the Stanford Network Analysis Platform 
(SNAP|^ which are summarized in Table We start by evaluat¬ 
ing the performance of our methods in characterizing 3-node CISes 
over million-node graphs: Flickr, Pokec, LiveJournal, YouTube, 
Web-Google, and Wiki-talk, contrasting our results with the ground 
truth computed through an exhaustive method. It is computation¬ 
ally intensive to calculate the ground-truth of 4-node and 5-nodes 
CIS classes in large graphs. For example, we can easily observe 
that a node with degree o? > 4 is included in at least ^d{d —l){d — 

^ WWW. snap. Stanford, edu 


Algorithm 4: 5-node CIS enumeration. 

input : G* = {V\E\L*) 

/-k records the number of CISes in G* 

isomorphic to motif , 1 << T 5 . */ 

output: \ • • •, ^tIY 


/-k The functions findNodes, minNodes, 

induced, and undirected are defined in 
Algorithms and [^. */ 

for u G y * do 

foru G iV*(u) do 

for w G A"* {u) and w y v do 

/-k Enumerate and update CIS s 
with undirected(s) not 
isomorphic to a 5-node line 
and circle. k/ 

for X G A* (u) and x y w do 

F E- A*(u) U A*(u) U N^w) U N^x) - 
{u, V, w, x}; 
for y G r do 

if {y, u) G E* and x y y then 
I continue(); 

end 

s E- induced(G*, {u, u, w, x, y}); 
s E- undirected(s); 

A E- f indNodes(s, 3); 

if |A| > 2 then 

if u minNodes(A) then 
I continue(); 

end 

end 

i E- M(s); 


A5) 


^ + 1 ; 


end 


end 

/-k Enumerate and update s with 
undirected(s) isomorphic to a 
5-node line or circle. */ 

if (u, v) ^ A* then 

Fy^N*{v)-{u,w}-N%u)-N*{w); 

for a; G do 

/-k s with undirected(s) 

isomorphic to a 5-node 
circle. */ 

^ A*(u;)-{u,u}-A*(u)-A*(u); 

for y G do 

if {x,y) G A* and 
u minNodes({u,v,w,x,y}) 

then 

I continue(); 

end 

s E- induced(G*, {u, u, w, x, y}); 
z^M(s); 




^ -f 1; 


end 


end 


end 


end 


end 


end 
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Table 2: Graph datasets used in our simulations, “edges" refers 
to the number of edges in the undirected graph generated by 
discarding edge labels, “max-degree" represents the maximum 
number of edges incident to a node in the undirected graph. 


Graph 

nodes 

edges 

max-degree 

Flickr 1^ 

1,715,255 

15,555,041 

27,236 

Pokec l30[ 

1,632,803 

22,301,964 

14,854 

Live! ournari[2l[ 

5,189,809 

48,688,097 

15,017 

YouTube j2Tr 

1,138,499 

2,990,443 

28,754 

Wiki-Talk 1^ 

2,394,385 

4,659,565 

100,029 

Web-Googlefri 

875,713 

4,322,051 

6,332 

soc-Epinionsl p5| 

75,897 

405,740 

3,044 

soc-Slashdot08 TTsI 

77,360 

469,180 

2,539 

soc-Slashdot09 I18J 

82,168 

504,230 

2,552 

sign-Epinions ll^ 

119,130 

704,267 

3,558 

sign-SlashdotOSnfr^ 

77,350 

416,695 

2,537 

sign-Slashdot09 |16j 

82,144 

504,230 

2,552 

com-DBLP 

317,080 

1,049,866 

343 

com- Amazon!^ 

334,863 

925,872 

549 

p2p-Gnutella084^j 

6,301 

20,777 

97 

ca-GrQc |p^ 

5,241 

14,484 

81 

ca-CondMaVpT) 

23,133 

93,439 

279 

ca-HepTh fl^ 

9,875 

25,937 

65 


2) 4-node CISes and -^d{d—l){d—2){d — 3) 5-node CISes, there¬ 
fore it requires more than 0(10^^) and 0(10^^) operations to enu¬ 
merate the 4-node and 5-node CISes of the Wiki-talk graph, which 
has a node with 100,029 neighbors. Even for a relatively small 
graph such as soc-Slashdot08, it takes almost 20 hours to compute 
all of its 4-node CISes. To solve this problem, the experiments 
for 4-node CISes are performed on four medium-sized graphs soc- 
Epinionsl, soc-Slashdot08, soc-Slashdot09, com-DBLP, and com- 
Amazon, and the experiments for 5-node CISes are performed on 
four relatively small graphs ca-GR-QC, ca-HEP-TH, ca-CondMat, 
and p2p-Gnutella08, where computing the ground-truth is feasible. 
We also evaluate the performance of our methods for characteriz¬ 
ing signed CIS classes in graphs sign-Epinions, sign-Slashdot08, 
and sign-Slashdot09. 

5.2 Error Metric 

In our experiments, we focus on the normalized root mean square 
error (NRMSE) to measure the relative error of the estimator Ui of 
the subgraph class concentration uji,i = 1 , 2 ,.... NRMSE((2;i) is 
defined as: 

NRMSE(a;i) = i = 

(jJi 

where MSE((£;i) is defined as the mean square error (MSE) of an 
estimate cj with respect to its true value cj > 0, that is 

MSE(c2;) = E[(c2; - o;)^] = var(c2;) + (E[w] - ujf . 

We note that MSE((I;) decomposes into a sum of the variance and 
bias of the estimator oj. Both quantities are important and need to 
be as small as possible to achieve good estimation performance. 
When cj is an unbiased estimator of a;, then we have MSE((2;) = 
var(a;) and thus NRMSE((2;i) is equivalent to the normalized stan¬ 
dard error of Ui, i.e., NRMSE(a;i) = ^y\av(uJ^)/uJi. Note that our 
metric uses the relative error. Thus, when uji is small, we consider 
values as large as NRMSE((2;i) = 1 to be acceptable. In all our ex¬ 


Tables: Values of , the concentrations of 3-node undirected 
and directed motifs. Flickr, Pokec, LiveJournal,Wiki-Talk, and 
Web-Google have 1.35 x 10^°, 2.02 x 10®, 6.90 x 10®, 1.2 x 10^°, 
and 7.00 x 10® 3-node CISes respectively, (i is the motif ID.) 



Flickr 

Pokec 

LiveLive- 

Wiki- 

Web- 

i 

Journal 

Talk 

Google 

undirected 3-node motifs 

1 

9.60e-01 

9.84e-01 

9.55e-01 

9.99e-01 

9.81e-01 

2 

4.04e-02 

1.60e-02 

4.50e-02 

7.18e-04 

1.91e-02 

directed 3-node motifs 

1 

2.17e-01 

1.77e-01 

7.62e-02 

8.91e-01 

1.27e-02 

2 

6.04e-02 

l.lle-01 

4.83e-02 

4.04e-02 

1.60e-02 

3 

1.28e-01 

1.60e-01 

3.28e-01 

3.91e-03 

9.28e-01 

4 

2.44e-01 

1.74e-01 

1.14e-01 

5.43e-02 

3.09e-03 

5 

1.31e-01 

1.91e-01 

1.73e-01 

5.48e-03 

1.92e-02 

6 

1.80e-01 

1.71e-01 

2.15e-01 

3.88e-03 

1.92e-03 

7 

5.69e-05 

7.06e-05 

2.74e-05 

1.37e-05 

4.91e-05 

8 

6.52e-03 

2.49e-03 

8.66e-03 

1.81e-04 

6.82e-03 

9 

1.58e-03 

1.03e-03 

1.06e-03 

8.42e-05 

2.84e-04 

10 

5.19e-03 

1.91e-03 

6.63e-03 

1.28e-04 

2.77e-03 

11 

6.46e-03 

2.03e-03 

6.27e-03 

8.03e-05 

5.98e-03 

12 

1.07e-02 

5.13e-03 

9.82e-03 

1.78e-04 

1.21e-03 

13 

9.86e-03 

3.45e-03 

1.26e-02 

6.65e-05 

2.00e-03 

Table 4: NRMSEs of wf \ 

the concentration estimates of 3- 

node undirected motifs for p 

= 0.01 andp = 0.05 respectively. 

{i is the motif ID.) 





Flickr 

Pokec 

LiveLive- 

Wiki- 

Web- 

i 

Journal 

Talk 

Google 



P 

= 0.01 



1 

1.92e-03 

3.26e-03 

2.69e-03 

5.21e-03 

2.93e-04 

2 

4.56e-02 

6.92e-02 

1.64e-01 

2.67e-01 

4.00e-01 



P 

= 0.05 



1 

2.90e-04 

4.10e-04 

2.64e-04 

6.06e-04 

2.92e-05 

2 

6.90e-03 

8.68e-03 

1.61e-02 

3.11e-02 

3.99e-02 


periments, we average the estimates and calculate their NRMSEs 
over 1,000 runs. 

5.3 Accuracy Results 

5.3.1 Accuracy of inferring 3-node motifs' concen¬ 
trations 

Table |3] shows the real values of the 3-node undirected and di¬ 
rected motifs’ concentrations for the undirected graphs and directed 
graphs of Flickr, Pokec, LiveJournal,Wiki-Talk, and Web-Google. 
Among all 3-node directed motifs, the 7-th motif exhibits the small¬ 
est concentration for all these five directed graphs. Here the undi¬ 
rected graphs are obtained by discarding the edge directions of 
directed graphs. Flickr, Pokec, Live Journal, Wiki-Talk, and Web- 
Google have 1.35 x 10^®, 2.02 x 10®, 6.90 x 10®, 1.2 x 10^®, and 
7.00 X 10® 3-node CISes respectively. Tablej^shows the NRMSEs 
of our estimates of 3-node undirected motifs’ concentrations for 
p = 0.01 and p = 0.05 respectively. We observe that the NRM¬ 
SEs associated with the sampling probability p — 0.05 is about ten 
times smaller than the NRMSEs when p = 0.01. The NRMSEs are 
smaller than 0.04 when p = 0.05 for all five graphs. Fig.shows 
































(a) p = 0.01 



motif ID 

(b) p = 0.05 


Figure 9: NRMSEs of the concentration estimates of 3-node directed motifs for p = 0.01 and p = 0.05 respectively. 


Table 5: Values of the concentrations of 3-node signed 
and undirected motifs. Sign-Epinions, sign-Slashdot08, sign- 
Slashdot09 have 1.72 x 10®, 6.72 x 10^, and 7.25 x 10^ 3-node 
CISes respectively, (i is the motif ID.) 


i 

sign-Epinions 

sign-Slashdot08 

sign-Slashdot09 

1 

6.69e-01 

6.58e-01 

6.68e-01 

2 

2.12e-01 

2.32e-01 

2.25e-01 

3 

9.09e-02 

1.02e-01 

9.96e-02 

4 

2.29e-02 

5.86e-03 

5.75e-03 

5 

2.76e-03 

9.74e-04 

9.34e-04 

6 

2.49e-03 

1.14e-03 

1.13e-03 

7 

3.81e-04 

1.80e-04 

1.76e-04 


Table 6: Values of u;^^\ the concentrations of 4-node undi¬ 
rected motifs. Soc-Epinionsl, soc-Slashdot08, soc-Slashdot09, 
and com-Amazon have 2.58 x 10^°, 2.17 x 10^°, 2.42 x 10^°, 
and 1.78 x 10® 4-node CISes respectively, (i is the motif ID.) 


i 

soc- 

Epinionsl 

soc- 

Slashdot08 

soc- 

Slashdot09 

com- 

Amazon 

1 

3.24e-01 

2.93e-01 

2.90e-01 

2.10e-01 

2 

6.15e-01 

6.86e-01 

6.89e-01 

6.99e-01 

3 

2.78e-03 

1.25e-03 

1.30e-03 

2.37e-03 

4 

5.45e-02 

1.86e-02 

1.84e-02 

7.69e-02 

5 

3.01e-03 

7.77e-04 

8.48e-04 

1.05e-02 

6 

2.25e-04 

9.19e-05 

9.36e-05 

1.55e-03 


the NRMSEs of our estimates of 3-node directed motifs’ concen¬ 
trations forp = 0.01 and p = 0.05 respectively. Similarly, we 
observe the NRMSEs when p = 0.05 are nearly ten times smaller 
than the NRMSEs when p = 0.01. The NRMSE of our estimates 
of (jj)r ^ (i.e., the 7-th 3-node directed motif concentration) exhibits 
the largest error. Except for the NRMSEs of the other motif 


concentrations’ estimates are smaller than 0.2 when p = 0.05. 

Tableshows the real values of 3-node signed motifs’ concen¬ 
trations for Sign-Epinions, sign-Slashdot08, and sign-Slashdot09. 
Sign-Epinions, sign-Slashdot08, and sign-Slashdot09 have 1.72 x 
10®, 6.72 X 10^, and 7.25 x 10^ 3-node CISes respectively. Fig. 10 


shows the NRMSEs of our estimates of 3-node signed and undi¬ 
rected motifs’ concentrations for p = 0.05 and p = 0.1 respec- 
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Table 7: Values of ojf\ concentrations of 5-node undirected 
motifs. Com-Amazon, com-DBLP, p2p-Gnutella08, ca-GrQc, 
ca-CondMat, and ca-HepTh have 8.50 x 10^, 3.34 x 10^°, 
3.92 X 10^ 3.64 x 10^ 3.32 x 10^ and 8.73 x lO'^ 5-node CISes 
respectively, {i is the motif ID.) 


i 

com-A 

mazon 

com- 

DBLP 

p2p-Gn 

utella08 

ca- 

GrQc 

ca-Con 

dMat 

ca- 

HepTh 

1 

2.9e-2 

1.4e-l 

2.6e-l 

9.8e-2 

1.4e-l 

2.6e-l 

2 

7.5e-l 

1.8e-l 

1.8e-l 

5.2e-2 

2.2e-l 

8.2e-2 

3 

1.6e-l 

4.4e-l 

4.6e-l 

2.1e-l 

4.3e-l 

4.4e-l 

4 

6.0e-3 

4.8e-2 

l.le-2 

l.Oe-1 

4.9e-2 

6.0e-2 

5 

2.3e-3 

l.le-3 

2.7e-2 

1.4e-3 

2.1e-3 

5.4e-3 

6 

3.6e-5 

5.0e-5 

1.4e-3 

9.2e-5 

l.le-4 

4.1e-4 

7 

1.5e-2 

5.6e-2 

2.7e-2 

l.le-1 

5.5e-2 

6.4e-2 

8 

3.5e-2 

7.9e-2 

2.2e-2 

1.2e-l 

8.0e-2 

5.2e-2 

9 

1.4e-3 

4.2e-3 

1.4e-3 

1.5e-2 

7.0e-3 

8.4e-3 

10 

1.7e-4 

1.4e-4 

l.Oe-3 

6.5e-4 

3.0e-4 

8.0e-4 

11 

7.3e-3 

8.1e-3 

4.3e-3 

2.3e-2 

9.9e-3 

l.Oe-2 

12 

5.3e-4 

6.4e-3 

2.8e-4 

2.3e-2 

4.5e-3 

3.6e-3 

13 

8.2e-5 

3.5e-6 

7.4e-4 

4.5e-6 

6.4e-6 

3.5e-5 

14 

3.9e-4 

5.2e-4 

1.7e-4 

2.8e-3 

6.6e-4 

l.Oe-3 

15 

6.7e-4 

2.6e-2 

7.6e-5 

1.5e-l 

5.9e-3 

5.3e-3 

16 

7.1e-4 

3.4e-4 

1.4e-4 

1.4e-3 

9.2e-4 

4.4e-4 

17 

3.9e-5 

l.le-5 

8.0e-5 

4.3e-5 

2.9e-5 

8.4e-5 

18 

2.3e-5 

4.9e-6 

6.0e-6 

2.3e-5 

8.5e-6 

3.0e-5 

19 

2.4e-4 

2.8e-3 

1.5e-5 

1.9e-2 

9.8e-4 

5.8e-4 

20 

5.8e-5 

4.2e-4 

7.0e-7 

8.0e-3 

1.4e-4 

8.2e-5 

21 

7.2e-6 

7.9e-3 

1.5e-8 

6.1e-2 

1.5e-4 

3.2e-3 


lively. For all these three signed graphs, the NRMSEs are smaller 
than 0.9 and 0.2 when p = 0.05 and p = 0.1 respectively. 


Except the NRMSEs of the other motif concentrations’ esti¬ 
mates are smaller than 0.2 for p = 0.2 . 




(a)p = 0.1 (b)p = 0.2 


Figure 11: NRMSEs of the concentration estimates of 4- 
node undirected motifs for p = 0.1, and p = 0.2 respectively. 


5.3.3 Accuracy of inferring 5-node motifs’ concen¬ 
trations 

Table ^ shows the real values of i.e., the concentrations 
of 5-node undirected motifs for com-Amazon, com-DBLP, p2p- 
GnutellaOS, ca-GrQc, ca-CondMat, and ca-HepTh. Com-Amazon, 
com-DBLP, p2p-Gnutella08, ca-GrQc, ca-CondMat, and ca-HepTh 
contains 8.50 x 10^ 3.34 x 10^°, 3.92 x 10^ 3.64 x lO'^, 3.32 x 
10^, and 8.73 x 10^ 5-node CISes respectively. Pig. 


NRMSEs of the concentration estimates of 5-node undirected 
motifs for p = 0.1, p = 0.2, andp = 0.3 respectively. We observe 
that NRMSE decreases as p increases, and the 6-th, 10-th, 13-th, 
17-th, 18-th 5-node motifs with small exhibit large NRMSEs. 
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shows the 




Figure 10: NRMSEs of the concentration estimates of 3- 
node signed and undirected motifs for p = 0.05 and p = 0.1 
respectively. 

5.3.2 Accuracy of inferring 4-node motifs’ concen¬ 
trations 

Table 1^ shows the real values of i.e., the concentrations of 
4-node undirected motifs for Soc-Epinionsl, soc-Slashdot08, soc- 
Slashdot09, and com-Amazon. Soc-Epinionsl, soc-Slashdot08, soc- 
Slashdot09, and com-Amazon have 2.58 x 10^°, 2.17 x 10^°, 
2.42 X 10^°, and 1.78 x 10® 4-node CISes respectively. Pig.pr] 
shows the NRMSEs of , the concentration estimates of 4-node 
undirected motifs for p = 0.05, p = 0.1, andp = 0.2 respectively. 
We observe that motifs with smaller exhibit larger NRMSEs. 


5.4 Error Bounds 

Pigurep^ shows the root CRLBs (RCRLBs) and the root MSEs 
(RMSEs) of our estimates of 3-node directed motifs’ concentra¬ 
tions, 4-, and 5-node undirected motifs’ concentrations, where graphs 
LiveJournal, soc-Epinions, and com-DBLP are used for studying 
3-node directed motifs, 4-, and 5-node undirected motifs respec¬ 
tively. We observe that the RCRLBs are smaller than the RMSEs, 
and fairly close to the RMSEs. The and RCRLBs are almost in¬ 
distinguishable for 3-node directed motifs, where p = 0.01 and 
LiveJournal is used. It indicates that the RCRLBs can efficiently 
bound the errors of our motif concentration estimations. 

6. RELATED WORK 

There has been considerable interest to design efficient sampling 
methods for counting specific subgraph patterns such as triangles 
[^[^[^, cliques |[^[^, and cycles fT^ , because it is computa¬ 
tionally intensive to compute the number of the subgraph pattern’s 
appearances in a large graph. Similar to the problem studied in GH 
in this work we focus on characterizing 3-, 4-, and 
5-nodes CISes in a single large graph, which differs from the prob¬ 
lem of estimating the number of subgraph patterns appearing in a 
large set of graphs studied in pO) . OmidiGenes et al. f22\ pro¬ 
posed a subgraph enumeration and counting method using sam¬ 
pling. However this method suffers from unknown sampling bias. 
To estimate subgraph class concentrations, Kashtan et al. GD pro¬ 
posed a subgraph sampling method, but their method is compu¬ 
tationally expensive when calculating the weight of each sampled 
subgraph, which is needed to correct for the bias introduced by 
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motif ID motif ID 


(d) ca-GrQc 


(e) ca-CondMat 



Figure 12: NRMSEs of (if \ the concentration estimates of 5-node undirected motifs for p = 0.1, p = 0.2, and p = 0.3 respectively. 


sampling. To address this drawback, Wernicke (37) proposed an al¬ 
gorithm, FANMOD, based on enumerating subgraph trees to detect 
network motifs. Bhuiyan et al. 0 proposed a Metropolis-Hastings 
based sampling method GUISE to estimate 3-node, 4-node, and 5- 
node subgraph frequency distribution. Wang et al. proposed an 
efficient crawling method to estimate online social networks’ motif 
concentrations, when the graph’s topology is not available in ad¬ 
vance and it is costly to crawl the entire topology. In summary, pre¬ 
vious methods focus on designing efficient sampling methods and 
crawling methods for estimating motif statistics when the graph is 
directly available or indirectly available (i.e., it is not expensive to 
query a node’s neighbors p6)). They cannot be applied to solve the 
problem studied in this paper, i.e., we assume the graph is not avail¬ 
able but a RESampled graph is given and we aim to infer the under¬ 
lying graph’s motif statistics from the RESampled graph. At last, 
we would like to point out our method of estimating motif statistics 
and its error bound computation method are inspired by methods of 
estimating fiow size distribution for network traffic measurement 
and monitoring |[8l [24l[^[^ . 


7. CONCLUSIONS 

To the best of our knowledge, we are the first to study the prob¬ 
lem of inferring the underlying graph’s motif statistics when the en¬ 
tire graph topology is not available, and only a RESampled graph 
is given. We propose a model to bridge the gap between the under¬ 


lying graph’s and its RESampled graph’s motif statistics. Based on 
this probabilistic model, we develop a method Minfer to infer the 
underlying graph’s motif statistics, and give a Eisher information 
based method to bound the error of our estimates, and experimen¬ 
tal results on a variety of known data sets validate the accuracy of 
our method. 


Appendix 


The matrixes P and P ^ is shown in Fig. 
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Figure 14: The matrixes P and P 
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